Skip to content

Conversation

@mikkihugo
Copy link
Contributor

@mikkihugo mikkihugo commented Nov 12, 2025

Overview

This PR establishes a complete automated workflow for synchronizing GitHub Linguist data and publishing language snapshots. It includes comprehensive pattern sync (Phases 2-4), automated workflows, and publishing infrastructure.

Key Features

🔄 Linguist Sync Tool (Phases 2, 3 & 4)

Phase 2: File Classification

  • 167 vendored code patterns from vendor.yml
  • 82 generated file patterns from generated.rb
  • Automatic pattern extraction and Rust code generation

Phase 3: Language Detection Heuristics

  • 124 disambiguation groups from heuristics.yml
  • 21 named pattern definitions
  • Complete rule-based language detection for ambiguous extensions

Phase 4: Language Metadata ⭐ NEW

  • Full metadata for 789 languages from languages.yml
  • Extensions, filenames, interpreters
  • Syntax highlighting modes (ace_mode, tm_scope, codemirror)
  • Visual metadata (colors, aliases)
  • Language categorization and editor config

📦 Generated Files

  • src/file_classifier_generated.rs (7.8K)
  • src/heuristics_generated.rs (117K)
  • src/languages_metadata_generated.rs (448K) ⭐ NEW
  • .github/linguist/languages.yml (154K) ⭐ NEW

🤖 Automated Workflows

sync-linguist.yml

  • Triggers on Renovate PRs for Linguist updates
  • Auto-generates all pattern files
  • Runs tests and commits changes
  • Posts PR comment with sync summary

publish-snapshot.yml

  • Triggers on push to main
  • Generates canonical snapshot from languages.yml
  • Validates JSON structure
  • Creates PR with updated snapshot

validate-snapshot.yml

  • Validates snapshots in all PRs
  • Ensures JSON integrity
  • Runs tests with generated snapshots

publish-docs.yml

  • Publishes Rust docs to GitHub Pages
  • Triggers on push to main

Technical Improvements

Sync Tool Refactor

  • Replaced regex-based parsing with proper serde YAML deserialization
  • Added structured types for all Linguist data models
  • Improved error handling with anyhow::Context
  • Added comprehensive logging with env_logger
  • Direct file writing (no stdout redirection)

Architecture

Provides both:

  1. Rust const data - Embedded in binary for performance
  2. Raw YAML - For external tooling and snapshot generation

This gives downstream consumers flexibility in integration approach.

Testing

✅ All 17 tests pass
✅ Clippy passes with pedantic + nursery lints
✅ Pre-commit/pre-push hooks pass
✅ Sync tool successfully fetches and parses latest Linguist data

Migration Notes

No breaking changes. This is purely additive functionality that enhances the existing language registry with automated sync capabilities.

Follow-up Work

  • Monitor first Renovate PR to verify auto-sync works
  • Review first snapshot PR after merge to main
  • Consider adding Phase 1 (manual language definitions) sync

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

mikkihugo and others added 6 commits November 12, 2025 18:41
- Add `supported_in_singularity` flag (defaults to false, explicitly true for our 24 languages)
- Add `language_type` field aligned with Linguist's classification
- Update all 24 language registrations with new fields
- Source of truth: <https://github.com/github-linguist/linguist/blob/main/lib/linguist/languages.yml>

## Governance Model
Language definitions now follow GitHub Linguist's standard:
- Prevents ad-hoc language additions
- Ensures consistency across ecosystem
- Automatic tracking via Renovate (weekly)

## Build Script Enhancement
Updated build.rs with future capability for:
- Automatic Linguist languages.yml synchronization
- Code generation from Linguist definitions
- Auto-update when Linguist adds new languages

## Renovate Configuration
- New rule to track Linguist releases (weekly)
- Labels: linguist, language-registry
- Manual review for language definition changes

This prepares Singularity for scalable language support while
maintaining explicit governance over what's actually supported.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
…cation

## What's New

FileClassifier Module: Detect vendored, generated, and binary files
- Uses patterns from GitHub Linguist (vendor.yml, generated.rb)
- Supports: vendored detection, generated file detection, binary detection
- Methods: is_vendored(), is_generated(), is_binary(), classify(), should_analyze()

Phase 1: Language Definitions - DONE
- Languages synced from Linguist languages.yml
- supported_in_singularity flag for explicit support
- Weekly Renovate alerts

Phase 2: File Classification - READY
- FileClassifier implementation complete
- Ready to auto-generate from Linguist patterns
- Supports: vendor paths, generated extensions, binary formats, documentation markers

Phase 3: Detection Heuristics - PLANNED
- Future: Auto-generate from Linguist heuristics.yml
- Fallback language detection for ambiguous extensions

New Files:
- src/file_classifier.rs: File classification engine
- LINGUIST_INTEGRATION.md: Complete documentation
- Updated build.rs: 3-phase roadmap
- Updated renovate.json5: Enhanced PR instructions

Benefits:
✅ Skip vendored code (node_modules/, vendor/)
✅ Skip generated files (.pb.rs, .generated.ts, etc.)
✅ Skip binary files (images, archives, executables)
✅ Auto-updated with Linguist releases
✅ Reduces false positives in code analysis

Testing: All tests pass, Clippy and fmt clean

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
Phase 2 Implementation: Auto-generate File Classification Patterns

New Files Added:

scripts/sync_linguist_patterns.py (200+ lines)
- Downloads vendor.yml from Linguist
- Downloads generated.rb from Linguist
- Parses YAML and Ruby code
- Extracts vendored, generated, and binary file patterns
- Generates Rust code arrays for FileClassifier

tools/linguist_sync.rs (130+ lines)
- Rust implementation roadmap
- Pattern parsing architecture
- Code generation infrastructure

Updated Files:

build.rs: Enhanced documentation
- Added manual synchronization workflow
- Documented automated (future) workflow
- Phase 2 in-progress status
- Maintenance instructions

justfile: New command
- just sync-linguist: Run Python script to sync patterns
- Provides step-by-step next actions
- Integrates into development workflow

LINGUIST_INTEGRATION.md: Detailed Phase 2 documentation
- Status: FileClassifier, Script, Integration, CI
- Manual + Automated sync workflows
- Implementation details
- Usage examples

Workflow:

For Maintainers (When Linguist Updates):
  just sync-linguist
  cargo test
  git add .
  git commit

For Automation (Future):
  cargo xtask sync-linguist

What Gets Synced:
- Vendored paths: node_modules/, vendor/, .yarn/
- Generated files: .pb.rs, .generated.ts, .designer.cs
- Binary formats: images, archives, executables

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
…tions

Complete Automation: Linguist Sync via Renovate + GitHub Actions

What's New:
- 100% Pure Rust implementation (no Python/Perl/Bash)
- GitHub Actions workflow for automatic sync
- Enhanced Cargo.toml with required dependencies
- Updated Renovate config with workflow info

Workflow:
1. Renovate detects Linguist update (weekly)
2. Creates PR automatically
3. GitHub Actions triggers sync tool
4. Downloads vendor.yml, generated.rb, heuristics.yml
5. Parses and generates src/file_classifier_generated.rs
6. Validates with cargo test
7. Auto-commits changes
8. Posts summary on PR

Phases Automated:
- Phase 2: File classification (vendor, generated, binary)
- Phase 3: Language detection heuristics (ambiguous extensions)

Files Modified:
- Cargo.toml: Added deps and bin definition
- tools/linguist_sync.rs: Full Rust implementation
- .github/workflows/sync-linguist.yml: GitHub Actions workflow
- renovate.json5: Updated PR instructions
- justfile: Updated sync command
- LINGUIST_INTEGRATION.md: Full documentation

100% Pure Rust with Renovate + GitHub Actions automation

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix example usage.rs to properly load AtomicBool values with Ordering::Relaxed
- Update doctest to use \`no_run\` to avoid test environment issues
- Update test fixture to include all PatternSignatures fields with defaults

This ensures compatibility with the updated LanguageInfo structure where
ast_grep_supported is now an AtomicBool instead of a plain bool.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@qodo-code-review
Copy link

qodo-code-review bot commented Nov 12, 2025

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Supply chain poisoning

Description: The tool treats a user-supplied Linguist file as trusted and writes a JSON snapshot
without validating or constraining fields (e.g., language ids, extensions), enabling
malicious or malformed inputs to poison the registry snapshot consumed by builds/CI and
potentially cause denial of service or incorrect downstream behavior.
main.rs [62-107]

Referred Code
// Read input; support JSON or YAML
let contents = std::fs::read_to_string(&input)?;
let map: serde_yaml::Value = if input.extension().and_then(|s| s.to_str()) == Some("json") {
    serde_json::from_str(&contents)?
} else {
    serde_yaml::from_str(&contents)?
};

let mut snapshots: Vec<SnapshotEntry> = Vec::new();

if let Some(obj) = map.as_mapping() {
    for (k, v) in obj {
        let id = k.as_str().unwrap_or_default().to_string();
        let name = id.clone();
        // Map some fields
        let extensions = v.get(&serde_yaml::Value::from("extensions")).and_then(|x| x.as_sequence()).map(|seq| {
            seq.iter().filter_map(|e| e.as_str().map(|s| s.to_string())).collect()
        }).unwrap_or_default();

        let aliases = v.get(&serde_yaml::Value::from("aliases")).and_then(|x| x.as_sequence()).map(|seq| {
            seq.iter().filter_map(|e| e.as_str().map(|s| s.to_string())).collect()


 ... (clipped 25 lines)
Untrusted input handling

Description: Registry initialization panics on missing/invalid SINGULARITY_LANGUAGE_SNAPSHOT and fully
trusts the JSON snapshot, loading arbitrary strings into runtime state without
schema/version validation or size limits, allowing crafted snapshots to crash builds (DoS)
or inflate memory usage.
registry.rs [246-285]

Referred Code
#[allow(
    clippy::panic,
    reason = "SINGULARITY_LANGUAGE_SNAPSHOT must be set to a valid languages JSON manifest path before initializing the registry"
)]
let snapshot_path = env::var("SINGULARITY_LANGUAGE_SNAPSHOT").unwrap_or_else(|_| {
    panic!("SINGULARITY_LANGUAGE_SNAPSHOT is not set. Provide a JSON snapshot exported from GitHub Linguist and set the env var to its path before building/releasing.");
});

let p = Path::new(&snapshot_path);
#[allow(
    clippy::manual_assert,
    reason = "Panic messages are informative for release blocker"
)]
if !p.exists() {
    #[allow(
        clippy::panic,
        reason = "Intentional panic when snapshot is missing - release blocker"
    )]
    {
        panic!("Language snapshot file not found at {snapshot_path}");
    }


 ... (clipped 19 lines)
Insecure update fetch

Description: The sync tool fetches remote content over HTTP(S) and logs retrieved byte counts but does
not pin versions, verify signatures, or enforce content-type/size limits, making it
susceptible to upstream tampering or large-response DoS during pattern synchronization.
linguist_sync.rs [24-42]

Referred Code
/// Fetch content from a URL
async fn fetch_url(url: &str) -> Result<String> {
    eprintln!("📥 Fetching {}", url);
    let client = reqwest::Client::new();
    let response = client
        .get(url)
        .timeout(std::time::Duration::from_secs(30))
        .send()
        .await
        .context(format!("Failed to fetch {}", url))?;

    let content = response
        .text()
        .await
        .context("Failed to read response body")?;

    eprintln!("✅ Fetched {} bytes", content.len());
    Ok(content)
}
Command execution risk

Description: Falls back to spawning cargo run with arguments derived from the repository without
explicit sanitization or execution policy, which in CI can unintentionally execute
workspace code when generating snapshots, increasing supply-chain risk if the repo or
dependencies are compromised.
main.rs [50-66]

Referred Code
if !ran {
    // Fall back to invoking `cargo run` for the converter. Use a
    // manifest-path so this can be run from the workspace root.
    let status = Command::new("cargo")
        .arg("run")
        .arg("--release")
        .arg("--manifest-path")
        .arg("tools/linguist_to_snapshot/Cargo.toml")
        .arg("--")
        .arg("--input")
        .arg(&input)
        .arg("--output")
        .arg(&out)
        .status()?;
    if !status.success() {
        anyhow::bail!("cargo run for linguist_to_snapshot failed");
    }
CI execution trust

Description: The script builds and runs workspace binaries and then runs another cargo command without
verifying inputs or locking toolchain/state, which can execute arbitrary code from the
workspace in CI; although intended, this expands the attack surface if upstream inputs are
compromised.
run_generate_snapshot.sh [23-30]

Referred Code
mkdir -p canonical
cd tools/linguist_to_snapshot
cargo build --release
cd - >/dev/null

# Run wrapper which will either call built binary or `cargo run` for the converter
cargo run --manifest-path tools/generate_snapshot_job/Cargo.toml -- --output "$OUT"
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing Audit Logs: New critical actions (loading a required snapshot from env, reading/parsing files,
toggling capabilities) are not logged, making it hard to audit who/what changed registry
state or why a panic occurred.

Referred Code
#[allow(
    clippy::panic,
    reason = "SINGULARITY_LANGUAGE_SNAPSHOT must be set to a valid languages JSON manifest path before initializing the registry"
)]
let snapshot_path = env::var("SINGULARITY_LANGUAGE_SNAPSHOT").unwrap_or_else(|_| {
    panic!("SINGULARITY_LANGUAGE_SNAPSHOT is not set. Provide a JSON snapshot exported from GitHub Linguist and set the env var to its path before building/releasing.");
});

let p = Path::new(&snapshot_path);
#[allow(
    clippy::manual_assert,
    reason = "Panic messages are informative for release blocker"
)]
if !p.exists() {
    #[allow(
        clippy::panic,
        reason = "Intentional panic when snapshot is missing - release blocker"
    )]
    {
        panic!("Language snapshot file not found at {snapshot_path}");
    }


 ... (clipped 39 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Panic On Config: Registry initialization panics on missing or invalid SINGULARITY_LANGUAGE_SNAPSHOT, which
is intentional but may preclude graceful degradation and recovery paths in some
environments.

Referred Code
#[allow(
    clippy::panic,
    reason = "SINGULARITY_LANGUAGE_SNAPSHOT must be set to a valid languages JSON manifest path before initializing the registry"
)]
let snapshot_path = env::var("SINGULARITY_LANGUAGE_SNAPSHOT").unwrap_or_else(|_| {
    panic!("SINGULARITY_LANGUAGE_SNAPSHOT is not set. Provide a JSON snapshot exported from GitHub Linguist and set the env var to its path before building/releasing.");
});

let p = Path::new(&snapshot_path);
#[allow(
    clippy::manual_assert,
    reason = "Panic messages are informative for release blocker"
)]
if !p.exists() {
    #[allow(
        clippy::panic,
        reason = "Intentional panic when snapshot is missing - release blocker"
    )]
    {
        panic!("Language snapshot file not found at {snapshot_path}");
    }


 ... (clipped 19 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Unstructured Logs: The sync tool writes unstructured status messages to stderr/stdout rather than structured
logs, which may hinder auditing and parsing in CI.

Referred Code
/// Fetch content from a URL
async fn fetch_url(url: &str) -> Result<String> {
    eprintln!("📥 Fetching {}", url);
    let client = reqwest::Client::new();
    let response = client
        .get(url)
        .timeout(std::time::Duration::from_secs(30))
        .send()
        .await
        .context(format!("Failed to fetch {}", url))?;

    let content = response
        .text()
        .await
        .context("Failed to read response body")?;

    eprintln!("✅ Fetched {} bytes", content.len());
    Ok(content)
}

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Weak Validation: External file inputs (Linguist maps) are parsed with minimal validation and default
assumptions, lacking schema checks and bounds/size limits beyond simple field handling.

Referred Code
// Read input; support JSON or YAML
let contents = std::fs::read_to_string(&input)?;
let map: serde_yaml::Value = if input.extension().and_then(|s| s.to_str()) == Some("json") {
    serde_json::from_str(&contents)?
} else {
    serde_yaml::from_str(&contents)?
};

let mut snapshots: Vec<SnapshotEntry> = Vec::new();

if let Some(obj) = map.as_mapping() {
    for (k, v) in obj {
        let id = k.as_str().unwrap_or_default().to_string();
        let name = id.clone();
        // Map some fields
        let extensions = v.get(&serde_yaml::Value::from("extensions")).and_then(|x| x.as_sequence()).map(|seq| {
            seq.iter().filter_map(|e| e.as_str().map(|s| s.to_string())).collect()
        }).unwrap_or_default();

        let aliases = v.get(&serde_yaml::Value::from("aliases")).and_then(|x| x.as_sequence()).map(|seq| {
            seq.iter().filter_map(|e| e.as_str().map(|s| s.to_string())).collect()


 ... (clipped 25 lines)

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

qodo-code-review bot commented Nov 12, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Use a build script for codegen

Instead of loading a JSON snapshot at runtime with complex tooling, use a
build.rs script to parse the language data at compile time and generate Rust
code directly. This simplifies the architecture and improves type safety.

Examples:

src/registry.rs [227-306]
    pub fn new() -> Self {
        let mut registry = Self {
            languages: HashMap::new(),
            extension_map: HashMap::new(),
            alias_map: HashMap::new(),
            mime_map: HashMap::new(),
        };

        // In tests we keep the built-in registration for convenience. In
        // normal builds/releases we require an externally-generated JSON

 ... (clipped 70 lines)
tools/linguist_to_snapshot/src/main.rs [1-112]
use anyhow::Result;
use clap::Parser;
use serde::{Deserialize, Serialize};
use serde_json::to_writer_pretty;
use std::fs::File;
use std::path::PathBuf;

#[derive(Parser)]
struct Args {
    /// Input Linguist YAML or JSON file (languages.yml)

 ... (clipped 102 lines)

Solution Walkthrough:

Before:

// src/registry.rs
impl LanguageRegistry {
    pub fn new() -> Self {
        // In release builds, this code runs at program startup
        let snapshot_path = env::var("SINGULARITY_LANGUAGE_SNAPSHOT")
            .unwrap_or_else(|_| panic!("... env var not set ..."));

        let contents = fs::read_to_string(&snapshot_path)
            .unwrap_or_else(|_| panic!("... failed to read file ..."));

        let snapshots: Vec<LanguageInfoSnapshot> = serde_json::from_str(&contents)
            .unwrap_or_else(|_| panic!("... failed to parse JSON ..."));

        let mut registry = Self::new_empty();
        for snap in snapshots {
            // Convert from snapshot struct to main struct
            registry.register_language(LanguageInfo::from(snap));
        }
        registry
    }
}

After:

// build.rs
fn main() {
    // This code runs at compile time
    let out_dir = env::var("OUT_DIR").unwrap();
    let dest_path = Path::new(&out_dir).join("languages.rs");

    let yaml_content = fs::read_to_string("path/to/languages.yml").unwrap();
    let languages_map: HashMap<String, LinguistEntry> = serde_yaml::from_str(&yaml_content).unwrap();

    let mut rust_code = "[\n".to_string();
    for (id, entry) in languages_map {
        // Generate Rust struct literals directly
        rust_code.push_str(&format!("    LanguageInfo {{ id: \"{}\", ... }},\n", id));
    }
    rust_code.push_str("]");
    fs::write(&dest_path, rust_code).unwrap();
}

// src/registry.rs
const BUILTIN_LANGUAGES: &[LanguageInfo] = &include!(concat!(env!("OUT_DIR"), "/languages.rs"));
Suggestion importance[1-10]: 9

__

Why: This is a high-impact architectural suggestion that proposes a simpler, more robust, and idiomatic Rust solution, correctly identifying the significant complexity introduced by the PR's runtime-based approach.

High
General
Use struct deserialization for parsing

Refactor the manual parsing of serde_yaml::Value to use serde_yaml::from_value
for direct deserialization into the LinguistEntry struct, simplifying the code.

tools/linguist_to_snapshot/src/main.rs [74-104]

-let id = k.as_str().unwrap_or_default().to_string();
-let name = id.clone();
-// Map some fields
-let extensions = v.get(&serde_yaml::Value::from("extensions")).and_then(|x| x.as_sequence()).map(|seq| {
-    seq.iter().filter_map(|e| e.as_str().map(|s| s.to_string())).collect()
-}).unwrap_or_default();
+let name = k.as_str().unwrap_or_default().to_string();
+let id = name.to_lowercase().replace(' ', "-");
 
-let aliases = v.get(&serde_yaml::Value::from("aliases")).and_then(|x| x.as_sequence()).map(|seq| {
-    seq.iter().filter_map(|e| e.as_str().map(|s| s.to_string())).collect()
-}).unwrap_or_default();
-
-let mime_types = v.get(&serde_yaml::Value::from("mime_types")).and_then(|x| x.as_sequence()).map(|seq| {
-    seq.iter().filter_map(|e| e.as_str().map(|s| s.to_string())).collect()
-}).unwrap_or_default();
-
-let tree_sitter_language = v.get(&serde_yaml::Value::from("tree_sitter_language")).and_then(|x| x.as_str()).map(|s| s.to_string());
+let entry: LinguistEntry = match serde_yaml::from_value(v.clone()) {
+    Ok(e) => e,
+    Err(_) => continue, // Skip entries that don't match our struct
+};
 
 snapshots.push(SnapshotEntry {
     id,
     name,
-    extensions,
-    aliases,
-    tree_sitter_language,
+    extensions: entry.extensions.unwrap_or_default(),
+    aliases: entry.aliases.unwrap_or_default(),
+    tree_sitter_language: entry.tree_sitter_language,
     rca_supported: false,
-    ast_grep_supported: true,
-    mime_types,
+    ast_grep_supported: true, // Default to true for now
+    mime_types: entry.mime_types.unwrap_or_default(),
     family: None,
     is_compiled: false,
-    language_type: "programming".to_string(),
+    language_type: entry._type.unwrap_or_else(|| "programming".to_string()),
     pattern_signatures: serde_json::Value::Null,
 });
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that using serde_yaml::from_value is more idiomatic and robust than manual field extraction, improving code quality and maintainability.

Medium
Possible issue
Improve JSON validation to check all elements
Suggestion Impact:The commit changed the jq check from validating only the first element to validating all elements, adding even stricter checks (non-empty string types) and improved error output.

code diff:

+          # ensure every object has non-empty string id and name fields
+          if ! jq -e 'all(.[]; (has("id") and has("name") and (.id|type=="string") and (.name|type=="string") and (.id != "") and (.name != "")))' "$OUT" >/dev/null; then
+            echo "Snapshot validation failed: one or more entries are missing required fields 'id' or 'name', or they are empty/non-string"
+            # Show up to first 5 offending entries
+            jq 'map(select( (has("id")|not) or (has("name")|not) or (.id|type!="string") or (.name|type!="string") or (.id=="") or (.name=="") )) | .[0:5]' "$OUT" || true

Improve the jq validation to check that all elements in the snapshot JSON array
are objects with id and name fields, not just the first element.

.github/workflows/publish-snapshot.yml [76-81]

 # ensure each object has id and name fields
-if ! jq -e '.[0] | has("id") and has("name")' "$OUT" >/dev/null; then
-  echo "Snapshot entries appear to be missing required fields (id/name)"
-  jq '.[0]' "$OUT" || true
+if ! jq -e 'all(type == "object" and has("id") and has("name"))' "$OUT" >/dev/null; then
+  echo "Snapshot is invalid: entries must be objects with id and name fields"
+  jq . "$OUT" || true
   exit 1
 fi

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a flaw in the validation logic where only the first element of the JSON array is checked, and it provides a more robust solution using jq's all function to validate every element.

Medium
  • Update

Use cargo:notice= instead of cargo:warning= for successful validation
messages. This prevents successful builds from showing as warnings when
the validation actually completed successfully.

Only use cargo:warning= for actual issues and errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 54 to +131
#[non_exhaustive]
pub struct LanguageInfo {
/// Unique language identifier (e.g., `"rust"`, `"elixir"`)
/// Derived from GitHub Linguist language names (lowercased)
pub id: String,
/// Human-readable language name (e.g., `"Rust"`, `"Elixir"`)
pub name: String,
/// File extensions for this language (e.g., `rs`, or `ex`/`exs`)
/// Source: GitHub Linguist
pub extensions: Vec<String>,
/// Alternative names/aliases (e.g., `js`, `javascript`)
pub aliases: Vec<String>,
/// Whether this language is supported by Singularity's parsing engine
/// Default: false (only explicitly supported languages are true)
pub supported_in_singularity: bool,
/// Tree-sitter language name (if supported)
pub tree_sitter_language: Option<String>,
/// Whether RCA (rust-code-analysis) supports this language
pub rca_supported: bool,
/// Whether AST-Grep supports this language
pub ast_grep_supported: bool,
pub rca_supported: AtomicBool,
/// Whether AST-Grep supports this language (set at runtime by engines)
pub ast_grep_supported: AtomicBool,
/// MIME types for this language
pub mime_types: Vec<String>,
/// Language family (e.g., "BEAM", "C-like", "Web")
pub family: Option<String>,
/// Whether this is a compiled or interpreted language
pub is_compiled: bool,
/// Language type from Linguist: "programming", "markup", "data", "prose"
pub language_type: String,
/// Pattern signatures for cross-language pattern detection
#[serde(default)]
pub pattern_signatures: PatternSignatures,
/// Dynamic capability bits controlled by downstream engines
#[serde(skip)]
pub capabilities: AtomicU32,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Derive adds serde bounds missing for atomic fields

The newly added atomics in LanguageInfo are still deriving Serialize/Deserialize, but AtomicBool and AtomicU32 do not implement those serde traits. The derive therefore cannot compile – the compiler will emit the trait Serialize is not implemented for AtomicBool (same for AtomicU32). Because this struct is used throughout the crate, the entire crate fails to build. Either drop the serde derives from LanguageInfo and rely on the new LanguageInfoSnapshot, or provide custom serialization helpers for the atomic fields.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Contributor

🔍 Automated Checks

🔍 Checking for stale files and out-of-scope changes...

Stale File Check

✅ No stale files detected

Scope Check

Checking file relevance (blocks binaries, temp files, etc.)...

✅ All changes appear relevant (includes .github/ workflows, src/, docs, config)

ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure


Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback.

…uild

Replace openssl-sys with pure Rust rustls-tls backend for reqwest.
This allows sync-linguist binary to build without system OpenSSL libraries,
enabling it to work in CI/CD environments without nix develop.

- Changed reqwest to use rustls-tls feature
- Disabled default-tls (OpenSSL) feature
- Resolves CI/CD build failures for sync-linguist binary

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link
Contributor

🔍 Automated Checks

🔍 Checking for stale files and out-of-scope changes...

Stale File Check

✅ No stale files detected

Scope Check

Checking file relevance (blocks binaries, temp files, etc.)...

✅ All changes appear relevant (includes .github/ workflows, src/, docs, config)

ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure


Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback.

mikkihugo and others added 4 commits November 12, 2025 20:58
This commit significantly improves the linguist sync tool (Phase 2 & 3):

## Tool Improvements
- Add proper logging support (log, env_logger)
- Replace regex-based parsing with serde YAML deserialization
- Add proper data structures for heuristics (Disambiguation, Rule, etc.)
- Improve error handling with anyhow::Context
- Write files directly to src/ instead of stdout redirection
- Increase fetch timeout from 30s to 45s for reliability

## Generated Files
- src/file_classifier_generated.rs (7.8K)
  - 167 vendored code patterns from vendor.yml
  - 82 generated file patterns from generated.rb
- src/heuristics_generated.rs (117K)
  - 124 disambiguation groups from heuristics.yml
  - 21 named patterns
  - Full rule-based language detection support

## Workflow Updates
- Update sync-linguist.yml to remove stdout redirect
- Track both generated files in commits
- Update documentation to mention both outputs

## Testing
- All 17 tests pass
- Tool successfully fetches and parses latest Linguist data
- Deterministic output (idempotent runs)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit adds comprehensive language metadata synchronization (Phase 4)
to complement the existing pattern sync (Phases 2 & 3).

## New Features
- Download and parse languages.yml (157KB, 789 languages)
- Generate Rust types with full language metadata:
  - Extensions, filenames, interpreters
  - Syntax highlighting modes (ace_mode, tm_scope, codemirror)
  - Visual metadata (colors, aliases)
  - Language categorization (type, group)
  - Editor configuration (wrap, fs_name)
- Save raw languages.yml to .github/linguist/ for snapshot workflow

## Generated Files
- src/languages_metadata_generated.rs (448KB)
  - `LanguageMetadata` struct with all Linguist fields
  - `LANGUAGES` const array with 789 language definitions
- .github/linguist/languages.yml (154KB)
  - Raw YAML for publish-snapshot workflow

## Workflow Updates
- Update sync-linguist.yml to commit all 4 generated files
- Update documentation to mention Phase 4
- Update PR comments to show complete sync status

## Architecture
The tool now provides both:
1. Rust const data (embedded in binary) for performance
2. Raw YAML (for external tooling and snapshot generation)

This gives downstream consumers flexibility to choose their integration approach.

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link
Contributor

🔍 Automated Checks

🔍 Checking for stale files and out-of-scope changes...

Stale File Check

✅ No stale files detected

Scope Check

Checking file relevance (blocks binaries, temp files, etc.)...

✅ All changes appear relevant (includes .github/ workflows, src/, docs, config)

ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure


Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback.

@mikkihugo mikkihugo changed the title ci(snapshot): publish canonical linguist-derived snapshot via PR feat: Complete Linguist sync automation and snapshot publishing workflow Nov 14, 2025
Release highlights:
- Complete Linguist sync automation (Phases 2, 3 & 4)
- 789 languages with full metadata
- Automated snapshot publishing workflows
- Enhanced development infrastructure

See CHANGELOG.md for full details.
@github-actions
Copy link
Contributor

🔍 Automated Checks

🔍 Checking for stale files and out-of-scope changes...

Stale File Check

✅ No stale files detected

Scope Check

Checking file relevance (blocks binaries, temp files, etc.)...

✅ All changes appear relevant (includes .github/ workflows, src/, docs, config)

ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure


Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback.

The workspace configuration puts binaries in target/release/, not
tools/*/target/release/. Updated both workflows to use the correct path.
@github-actions
Copy link
Contributor

🔍 Automated Checks

🔍 Checking for stale files and out-of-scope changes...

Stale File Check

✅ No stale files detected

Scope Check

Checking file relevance (blocks binaries, temp files, etc.)...

✅ All changes appear relevant (includes .github/ workflows, src/, docs, config)

ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure


Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback.

Use -p linguist_to_snapshot instead of --bin to properly build
workspace member binaries.
@github-actions
Copy link
Contributor

🔍 Automated Checks

🔍 Checking for stale files and out-of-scope changes...

Stale File Check

✅ No stale files detected

Scope Check

Checking file relevance (blocks binaries, temp files, etc.)...

✅ All changes appear relevant (includes .github/ workflows, src/, docs, config)

ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure


Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback.

@mikkihugo mikkihugo enabled auto-merge (squash) November 14, 2025 08:21
@github-actions
Copy link
Contributor

🔍 Automated Checks

🔍 Checking for stale files and out-of-scope changes...

Stale File Check

✅ No stale files detected

Scope Check

Checking file relevance (blocks binaries, temp files, etc.)...

✅ All changes appear relevant (includes .github/ workflows, src/, docs, config)

ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure


Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback.

@qodo-code-review
Copy link

qodo-code-review bot commented Nov 14, 2025

CI Feedback 🧐

(Feedback updated until commit c44b5f4)

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: CI Success

Failed stage: Check All Jobs [❌]

Failure summary:

The action failed due to a gating step that exits on failed critical checks:
- The conditional block
if [[ "failure" != "success" ]]; then evaluated to true, triggering the message ❌ Nix checks failed
and calling exit 1.
- As a result, the job terminated with exit code 1 before printing ✅ All
critical checks passed!.
- This indicates the prior Nix-related checks were marked as failure in the
workflow’s environment/status, causing the failure gate to stop the job.

Relevant error logs:
1:  ##[group]Runner Image Provisioner
2:  Hosted Compute Agent
...

26:  Metadata: read
27:  Models: read
28:  Packages: write
29:  Pages: write
30:  PullRequests: write
31:  RepositoryProjects: write
32:  SecurityEvents: write
33:  Statuses: write
34:  ##[endgroup]
35:  Secret source: Actions
36:  Prepare workflow directory
37:  Prepare all required actions
38:  Complete job name: CI Success
39:  ##[group]Run if [[ "failure" != "success" ]]; then
40:  �[36;1mif [[ "failure" != "success" ]]; then�[0m
41:  �[36;1m  echo "❌ Nix checks failed"�[0m
42:  �[36;1m  exit 1�[0m
43:  �[36;1mfi�[0m
44:  �[36;1mif [[ "skipped" != "success" && "skipped" != "skipped" ]]; then�[0m
45:  �[36;1m  echo "❌ MSRV check failed"�[0m
46:  �[36;1m  exit 1�[0m
47:  �[36;1mfi�[0m
48:  �[36;1mecho "✅ All critical checks passed!"�[0m
49:  shell: /usr/bin/bash -e {0}
50:  ##[endgroup]
51:  ❌ Nix checks failed
52:  ##[error]Process completed with exit code 1.
53:  Cleaning up orphan processes

@mikkihugo mikkihugo disabled auto-merge November 15, 2025 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants